Skip to content

ENH: Add orient=tight format for dictionaries #35292

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 11 commits into from
Oct 16, 2021
Merged

Conversation

Dr-Irv
Copy link
Contributor

@Dr-Irv Dr-Irv commented Jul 15, 2020

The issue in #4889 contains examples of how the JSON format doesn't support MultiIndex for the index or the columns, independent of the orient chosen. Also, none of the orientations support index names (or MultiIndex names). Finally, if you want to have a JSON format that is "tight" (or "compact"), the split is the closest thing, but it is incomplete due to the indexing issues.

As a first step to addressing this, I have created a orient='tight' option to DataFrame.to_dict() and DataFrame.from_dict() . If we agree that this is a reasonable representation, the next step (which I'd prefer to do in a second PR) would be to also support it with JSON. The challenge there is that to_json() is currently implemented in C. Right now, one can just use json.dumps(df.to_dict(orient='tight')) to get the needed JSON as a workaround.

I'm open to changing the word 'tight' to something else, but I wanted it to start with a different letter than the other orientations.

@WillAyd
Copy link
Member

WillAyd commented Jul 21, 2020

Hmm +/- 0 on adding anything to to_dict here; we already have a lot of formats that kind of do the same things but not really and it's not always spelled out. While this could have some utility I think it adds to that confusion

The challenge there is that to_json() is currently implemented in C.

If the ultimate goal is to include in the table schema most of that is written in Python, so maybe not as much of a concern?

@WillAyd WillAyd added IO Data IO issues that don't fit into a more specific label Needs Discussion Requires discussion from core team before further action labels Jul 21, 2020
@Dr-Irv
Copy link
Contributor Author

Dr-Irv commented Jul 21, 2020

Hmm +/- 0 on adding anything to to_dict here; we already have a lot of formats that kind of do the same things but not really and it's not always spelled out. While this could have some utility I think it adds to that confusion

The challenge there is that to_json() is currently implemented in C.

If the ultimate goal is to include in the table schema most of that is written in Python, so maybe not as much of a concern?

@WillAyd Here are two important comments in the original issue that this PR addresses:

#4889 (comment) states (and I agree with this observation) that the table format is not compact when it comes to JSON. The size of the 'table' format is not as compact as possible when you need to push JSON around as the payload in a REST API.

#4889 (comment) states the issue with not preserving the index correctly.

@Dr-Irv Dr-Irv marked this pull request as draft September 22, 2020 14:09
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some comments, looks pretty good if you'd update the notes to 1.2

else:
realdata = data["data"]

def create_index(indexlist, namelist):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think ensure_index does this

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Dr-Irv can you review this comment here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback ensure_index doesn't handle the names of the indexes (or columns), so you'd still have logic to handle that part anyway.

Copy link
Contributor

@jreback jreback Oct 16, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this work
e..g just use Index

In [98]: pd.Index([(1, 0), (1, 1)], names=['foo', 'bar'])
Out[98]: 
MultiIndex([(1, 0),
            (1, 1)],
           names=['foo', 'bar'])

In [99]: pd.Index([1, 2], names=['foo'])
Out[99]: Int64Index([1, 2], dtype='int64')

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, because note that the argument to set the names of a MultiIndex is names, while for an Index it is name . In your example in cell 99, you lost the name of the index.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

grr this is a bug, actually is this still true on master (this was a pretty old version i am using)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still an issue. Using pd.Index, with names, you get a FutureWarning. With name, you get a stack trace.

In [1]: import pandas as pd

In [2]: pd.__version__
Out[2]: '1.4.0.dev0+905.gdace93d694'

In [3]: pd.Index([(1,0), (1,1)], names=["foo", "bar"])
<ipython-input-3-99fe0b033485>:1: FutureWarning: Passing keywords other than 'data', 'dtype', 'copy', 'name', 'tupleize_cols' is deprecated and will raise TypeError in a future version.  Use the specific Index subclass directly instead.
  pd.Index([(1,0), (1,1)], names=["foo", "bar"])
Out[3]:
MultiIndex([(1, 0),
            (1, 1)],
           names=['foo', 'bar'])

In [4]: pd.Index([(1,0), (1,1)], name=["foo", "bar"])
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-4-138c3fe61553> in <module>
----> 1 pd.Index([(1,0), (1,1)], name=["foo", "bar"])

c:\Code\pandas_dev\pandas\pandas\core\indexes\base.py in __new__(cls, data, dtype, copy, name, tupleize_cols, **kwargs)
    403         from pandas.core.indexes.range import RangeIndex
    404
--> 405         name = maybe_extract_name(name, data, cls)
    406
    407         if dtype is not None:

c:\Code\pandas_dev\pandas\pandas\core\indexes\base.py in maybe_extract_name(name, obj, cls)
   6876     # GH#29069
   6877     if not is_hashable(name):
-> 6878         raise TypeError(f"{cls.__name__}.name must be a hashable type")
   6879
   6880     return name

TypeError: Index.name must be a hashable type

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nvm we removed names=. i still find this a bit of a footgun.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok i think we need to handle this a bit bitter if you wouldn't mind opening an issue.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback Seems like we ought to have another issue with respect to pd.Index([(1,0), (1,1)], name=["foo", "bar"]) not returning a MultiIndex ??

@Dr-Irv
Copy link
Contributor Author

Dr-Irv commented Sep 25, 2020

@jreback before I make the change you suggested, when we had the dev sprint on 9/4/20, we discussed this PR, and it was suggested that I add an argument such as preserve_names to to_dict() that would apply to the "split" format rather than introduce the "tight" option.

So what is your opinion about introducing orient="tight" versus orient="split", preserve_names=False (where the latter would be used to preserve compatibility with prior versions)?

@simonjayhawkins
Copy link
Member

@Dr-Irv what's the status here?

@Dr-Irv
Copy link
Contributor Author

Dr-Irv commented Mar 14, 2021

@Dr-Irv what's the status here?

I've been waiting for @jreback to respond to this comment: #35292 (comment)

I would like to introduce orient="tight", but I'm not sure there is agreement on that. If we can agree, I'll move forward with it.

@jreback
Copy link
Contributor

jreback commented Oct 4, 2021

looks ok, can you merge master and will look

@Dr-Irv Dr-Irv marked this pull request as ready for review October 4, 2021 13:53
@Dr-Irv
Copy link
Contributor Author

Dr-Irv commented Oct 4, 2021

@jreback have merged with master and all checks passed. Changed it from draft status.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good a comment

else:
realdata = data["data"]

def create_index(indexlist, namelist):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Dr-Irv can you review this comment here

@jreback jreback merged commit 7f2b418 into pandas-dev:master Oct 16, 2021
@jreback
Copy link
Contributor

jreback commented Oct 16, 2021

thanks @Dr-Irv

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO Data IO issues that don't fit into a more specific label Needs Discussion Requires discussion from core team before further action
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants